An Alternate Approach on Statistical Single-label Document Classification for Greek Newspaper Articles
نویسندگان
چکیده
Text classification is one of the most important sectors of machine learning theory. It enables a series of tasks among which are email spam filtering, text summarization and context identification. It has been one of the most intriguing tasks in the computer theory, as it represents the projection of human thought in machine form. Classification theory proposes a number of different techniques based on statistics (Naive Bayes Classifier (NBC), language models (LM)), artificial intelligence (Neural Networks) and decision trees among others. Classification systems are typically distinguished into single-label categorization and multi-label categorization systems, according to the number of categories they assign to each of the classified documents. In this paper, we present work undertaken in the area of single-label classification which resulted in a statistical classifier, based on the Naive Bayes assumption of statistical independence of word occurrence across a document. Our algorithm, takes into account cross-category word occurrence in deciding the class of a random document. Moreover, instead of estimating word co-occurrence in deciding a class, we estimate word contribution for a document to belong in a class. This approach outperforms other statistical classifiers as Naive Bayes Classifier and Language Models, as it was proven in our results.
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملروش جدید متنکاوی برای استخراج اطلاعات زمینه کاربر بهمنظور بهبود رتبهبندی نتایج موتور جستجو
Today, the importance of text processing and its usages is well known among researchers and students. The amount of textual, documental materials increase day by day. So we need useful ways to save them and retrieve information from these materials. For example, search engines such as Google, Yahoo, Bing and etc. need to read so many web documents and retrieve the most similar ones to the user ...
متن کاملThe Discursive Construction of Ethnic Identities: The Case of Greek-Cypriot Students
This study examines how Greek-Cypriot students aged 12 to 18, an understudied group of students, construct their ethnic identity in a complex setting such as Cyprus and what motivates the students in the selection of ethnic identity labels. The choice to focus on students aged 12-18 was made on the hypothesis that young children, who did not experience the 1974 war in Cyprus, may have a differe...
متن کاملA Multilingual Polarity Classification Method using Multi-label Classification Technique Based on Corpus Analysis
In NTCIR-7 MOAT, we participated in four sub-tasks (opinion & holder detection, relevance judg-ment, and polarity classification) at two languagesides: Japanese and English. In this paper, we fo-cused on the feature selection and polarity classifi-cation methodology in both languages. To detectopinion and classify the polarity, the features wereselected based on a st...
متن کاملA language independent approach to multilingual text summarization
This paper describes an efficient algorithm for language independent generic extractive summarization for single document. The algorithm is based on structural and statistical (rather than semantic) factors. Through evaluations performed on a single-document summarization for English, Hindi, Gujarati and Urdu documents, we show that the method performs equally well regardless of the language. T...
متن کامل